Validated Pretreatment Prediction Models for Response to Neoadjuvant Therapy in Patients with Rectal Cancer: A Systematic Review and Critical Appraisal

Simple Summary Organ preservation strategies can be offered to patients with rectal cancer that show a strong response to preoperative treatment in order to avoid major surgery. Prediction models may help identify these patients before the start of preoperative treatment, when treatment can still be adapted. We systematically reviewed validated pretreatment prediction models for response to preoperative treatment in patients with rectal cancer. Sixteen studies were included in this review. All studies were considered to have a high risk of bias and external validation was missing. Nevertheless, some studies show promising results, which could serve as a foundation for future research. Our systematic review provides a comprehensive overview of the current state of the literature regarding pretreatment prediction models for response to preoperative treatment in patients with rectal cancer. Abstract Pretreatment response prediction is crucial to select those patients with rectal cancer who will benefit from organ preservation strategies following (intensified) neoadjuvant therapy and to avoid unnecessary toxicity in those who will not. The combination of individual predictors in multivariable prediction models might improve predictive accuracy. The aim of this systematic review was to summarize and critically appraise validated pretreatment prediction models (other than radiomics-based models or image-based deep learning models) for response to neoadjuvant therapy in patients with rectal cancer and provide evidence-based recommendations for future research. MEDLINE via Ovid, Embase.com, and Scopus were searched for eligible studies published up to November 2022. A total of 5006 studies were screened and 16 were included for data extraction and risk of bias assessment using Prediction model Risk Of Bias Assessment Tool (PROBAST). All selected models were unique and grouped into five predictor categories: clinical, combined, genetics, metabolites, and pathology. Studies generally included patients with intermediate or advanced tumor stages who were treated with neoadjuvant chemoradiotherapy. Evaluated outcomes were pathological complete response and pathological tumor response. All studies were considered to have a high risk of bias and none of the models were externally validated in an independent study. Discriminative performances, estimated with the area under the curve (AUC), ranged per predictor category from 0.60 to 0.70 (clinical), 0.78 to 0.81 (combined), 0.66 to 0.91 (genetics), 0.54 to 0.80 (metabolites), and 0.71 to 0.91 (pathology). Model calibration outcomes were reported in five studies. Two collagen feature-based models showed the best predictive performance (AUCs 0.83–0.91 and good calibration). In conclusion, some pretreatment models for response prediction in rectal cancer show encouraging predictive potential but, given the high risk of bias in these studies, their value should be evaluated in future, well-designed studies.

Abstract: Pretreatment response prediction is crucial to select those patients with rectal cancer who will benefit from organ preservation strategies following (intensified) neoadjuvant therapy and to avoid unnecessary toxicity in those who will not. The combination of individual predictors in multivariable prediction models might improve predictive accuracy. The aim of this systematic review was to summarize and critically appraise validated pretreatment prediction models (other than radiomics-based models or image-based deep learning models) for response to neoadjuvant therapy in patients with rectal cancer and provide evidence-based recommendations for future research. MEDLINE via Ovid, Embase.com, and Scopus were searched for eligible studies published up to November 2022. A total of 5006 studies were screened and 16 were included for data extraction and risk of bias assessment using Prediction model Risk Of Bias Assessment Tool (PROBAST). All selected models were unique and grouped into five predictor categories: clinical, combined, genetics, metabolites, and pathology. Studies generally included patients with intermediate or advanced tumor stages who were treated with neoadjuvant chemoradiotherapy. Evaluated outcomes were pathological complete response and pathological tumor response. All studies were considered to have a high risk of bias and none of the models were externally validated in an independent study. Discriminative performances, estimated with the area under the curve (AUC), ranged per predictor category from 0.60 to 0.70 (clinical), 0.78 to 0.81 (combined), 0.66 to 0.91 (genetics), 0.54 to 0.80 (metabolites), and 0.71 to 0.91 (pathology). Model calibration outcomes were reported in five studies. Two collagen feature-based models showed the best predictive performance (AUCs 0.83-0.91 and good calibration). In conclusion, some pretreatment models for response prediction in rectal cancer show encouraging predictive potential but, given the high risk of bias in these studies, their value should be evaluated in future, well-designed studies.

Introduction
Due to optimization of rectal cancer diagnosis and treatment, most notably by the introduction of total mesorectal excision (TME) and neoadjuvant therapy, local recurrence rates have been significantly reduced [1,2]. Neoadjuvant therapy is usually given for more advanced tumors, as either long-course chemoradiotherapy (CRT) or short-course radiotherapy (SCRT), to obtain downsizing of the tumor. More recently, total neoadjuvant treatment (TNT) has emerged as an alternative to decrease the number of distant metastases without compromising local control [2][3][4][5]. The positive effect of neoadjuvant therapy followed by TME on oncological outcomes has been widely recognized, although it is also associated with significant acute and long-term gastro-intestinal, urinary, and sexual morbidity [6,7].
A selected group of patients with rectal cancer shows no residual viable tumor cells in the resected specimen after neoadjuvant therapy, known as a pathological complete response (pCR) [8]. Preoperatively, a clinical complete response (cCR) can be defined with high accuracy based on digital rectal examination, rectoscopy, and MRI [9]. Patients with a cCR can be offered a "watch and wait" (W&W) strategy to avoid major surgery. Some patients with a near-cCR (ncCR) can be treated with additional treatment to avoid TME, such as local excision or contact X-ray brachytherapy (Papillion) [10][11][12], followed by a W&W approach if a complete response is achieved. Accumulating evidence shows that a W&W strategy might be considered as a safe alternative to TME surgery [6,[13][14][15]. Furthermore, the W&W strategy is associated with favorable functional outcomes, good quality of life [16,17], and avoids a permanent stoma in patients with a very low tumor [18,19]. For some patients, the W&W strategy may be the optimal treatment option, and carefully comparing the advantages and challenges of this strategy versus standard TME should be part of a shared decision-making process.
The chance of achieving a complete response after neoadjuvant therapy is partly dependent on the neoadjuvant treatment regimen. In patients with locally advanced rectal cancer (LARC), CRT results in pCR rates of 10-30% [3], whereas TNT increases this chance [2][3][4][5], with better outcomes after the use of consolidation therapy versus induction therapy [20,21]. Nevertheless, TNT is associated with a higher risk of toxicity [20], which may impair patients' quality of life. Careful selection of patients for type of neoadjuvant regimen is therefore essential.
Eligibility for a W&W strategy is assessed after completion of neoadjuvant therapy. Response assessment is generally recommended 12 weeks after the start of neoadjuvant CRT or SCRT and, in case of a ncCR, again at 16-20 weeks [22]. For TNT, timing of response assessment depends on the duration of the treatment and could take up to 38 weeks [22]. Ideally, selection of eligible patients for a W&W strategy is performed before the start of neoadjuvant therapy. Pretreatment response prediction would allow for personalized neoadjuvant approaches based on individual patient characteristics. This is particularly important now that neoadjuvant therapy is intensified and more often used in patients with early tumor stages, with the intention of achieving a higher rate of organ preservation [23][24][25][26][27][28][29]. In order to select patients who will benefit the most from (intensified) neoadjuvant therapy and avoid unnecessary toxicity in those who are ineligible for W&W but still require TME surgery, pretreatment response prediction is critical.
However, pretreatment prediction of response is highly challenging, as rectal cancer is a heterogeneous disease and response to neoadjuvant therapy can vary greatly, indicating a complex relationship between tumor characteristics and treatment response [30,31]. As a result, no individual pretreatment predictor currently has the ability to accurately select all eligible patients for organ preservation [31]. To improve predictive accuracy, individual pretreatment predictors can be combined in multivariable prediction models. Recently, several publications have shown the predictive potential of radiomics-based models (which extract numerous quantitative features from medical images [32]) and imaged-based deep learning models [32][33][34]. Other promising candidate predictors for pretreatment models include conventional clinical factors that are commonly measured during routine diagnostic work-up, like cTNM-stage and carcinoembryonic antigen (CEA) [31,35], or more novel predictors such as mismatch-repair (MMR) status now that immunotherapy shows encouraging results in MMR-deficient tumors [21]. At the moment, an overview of pretreatment prediction models that combine predictors other than image-based features is lacking.
In this systematic review, we aim to summarize and critically appraise validated pretreatment prediction models, other than radiomics-based models or image-based deep learning models, for response to neoadjuvant therapy in patients with rectal cancer and provide evidence-based recommendations for future research.

Reporting
This systematic review is reported according to the "Preferred Reporting Items for Systematic reviews and Meta-Analyses" (PRISMA) statement [36] and the PRISMA extension for searching (PRISMA-S) [37]. The research protocol was registered in PROSPERO (CRD42023385057).

Search Strategy
MEDLINE via Ovid, Embase.com, and Scopus were searched by an information specialist (SvdM). The schematic search was as follows: Rectal Cancer AND Response AND (Neoadjuvant therapy OR CRT/SCRT/TNT). Free text terms including synonyms were used. In addition, thesaurus terms were used for MEDLINE (MeSH) and Embase (Emtree). See Supplementary Table S1 for the complete search. No filters, restrictions, or limits were used, except for the exclusion of references published prior to 1 January 2012. The search was not based on previous work and reviewed by a second information specialist. It was originally executed on 9 June 2022 and updated on 18 November 2022. All searches were performed separately on their respective platforms (i.e., no federated searches). The results were deduplicated in EndNote 20 [38], first based on PMID, second on DOI, and finally a manual check was performed. Forward and backward citation chasing ("snowballing") was performed for the included articles. No other databases, registries, or alternative methods were used.

General Inclusion and Exclusion Criteria
Studies investigating a multivariable (at least two predictors) pretreatment prediction model for response to neoadjuvant therapy were selected if they met the following criteria: (1) rectal cancer patients without metastases (cT1-4N0-2M0), (2) adenocarcinoma, (3) neoadjuvant radiotherapy with or without additional systemic therapy, (4) outcome defined as either pCR, pathological tumor response, cCR, or ncCR followed by additional successful organ-preserving treatment, (5) at least one development or validation cohort with a sample size of 50 or more, and (6) any form of internal or external validation. The following study types were excluded: studies that included patients with mucinous, signet-ring cell or neuroendocrine tumors, studies including radiomics-based models or image-based deep learning models, non-English studies, studies published in non-peer-reviewed journals, conference abstracts, case reports, preclinical studies, (systematic) reviews, and metaanalyses.

Development and Validation Studies
Studies with type 1b-4 models, according to the "transparent reporting of a multivariable prediction model for individual prognosis or diagnosis" (TRIPOD) statement, were selected (Table 1) [39]. Type 1a studies do not include any form of validation and were excluded from this review. Type 1b-2 studies use a single dataset for model development and validation. Type 1b and 2a are internal validation studies. Type 1b studies use resampling methods (e.g., cross-validation or bootstrapping) and type 2a studies randomly split the data into a development and validation cohort. Type 2b studies nonrandomly split the data into a development and validation cohort, for instance, based on location (e.g., different hospital) or time (different moment of inclusion). Because of the nonrandom variation, type 2b can be seen as intermediary between internal and external validation, according to TRIPOD [39]. Type 3 and 4 are external validation studies. Type 3 studies use separate development and validation cohorts; type 4 studies do not develop a model but only validate an existing one.

Article Selection and Data Extraction
Using the eligibility criteria of Section 2.3, two independent reviewers (MDT and BMG) screened titles and abstracts, using Rayyan [40], to identify potential articles. The same independent reviewers performed subsequent full-text eligibility assessment. Reasons for exclusion of ineligible articles were recorded. Included articles had to be approved by both reviewers. Disagreements were initially resolved by discussion between the reviewers and, if necessary, with help from a third reviewer (AMC). The data extraction form was based on the TRIPOD statement [39], the 11 domains of the "Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies" (CHARMS) checklist [41], and the "Prediction model Risk Of Bias Assessment Tool" (PROBAST) [42]. Completed data extraction forms of the first five articles were discussed by MDT, BMG, and AMC to ensure that no relevant data were missed. Data extraction of the remaining articles was performed by MDT. Authors were not contacted in case of missing data. The data extraction form and extracted data have not been made publicly available.

Risk of Bias Assessment
PROBAST was used for risk of bias (ROB) assessment and evaluation of concerns regarding the applicability of a model [42,43]. ROB is assessed across four domains: participants, predictors, outcome, and analysis. Applicability is assessed across the first three domains. Each domain contains specific signaling questions to help with the final ROB and applicability rating; some questions are only related to model development studies. As suggested by PROBAST, a prediction model was considered to have high risk of bias or high concerns regarding applicability if at least one domain was judged as high. An unclear rating was given if at least one domain was judged as unclear and none of the other domains were judged as high. Consequently, a prediction model was only judged as low risk if all domains were graded as such. PROBAST forms of included articles were completed by MDT and discussed by MDT, BMG, and AMC.

Data Synthesis
For each prediction model, relevant information and ROB assessment was summarized in a descriptive synthesis, supported by tables and figures. Predictive performance measures that were reported included model discrimination and calibration. Discrimination is often quantified and reported with the area under the curve (AUC). The apparent AUC (discrimination of the model on the development cohort before internal validation) and validated AUCs were reported. An AUC of 1.0 indicates perfect discrimination. As a general rule of thumb, AUC values can be considered as poor (0.5-0.7), moderate (0.7-0.8), good (0.8-0.9), or excellent (0.9-1.0) [44]. Calibration is commonly evaluated with a goodness-offit test, such as the Hosmer-Lemeshow test (a p-value < 0.05 indicates miscalibration [45]), or graphically with a calibration curve, which can be quantified by the calibration slope and intercept. If a study included multiple prediction models, results of the model with the best discrimination were reported. Prediction models were categorized according to type of included predictors. No meta-analysis was performed due to expected heterogeneity.

PROBAST Risk of Bias Assessment
Results of ROB assessment using PROBAST are shown in Table 4 and Figure 2. All studies were classified as having a high ROB, mainly because all studies had problems arising in the analysis domain. Reasons for a high ROB in the analysis domain were: 11 of 14 (79%) model development studies did not account for overfitting appropriately, mostly because resampling methods were not used for all model development procedures, including predictor selection. In 12 studies (75%), the low number of participants with the outcome (pCR/GR) compared to the number of candidate predictors could lead to a high ROB. Four of fourteen (29%) model development studies did not avoid predictor selection based on univariable analysis and three (19%) studies did not handle continuous and/or categorical predictors appropriately. Moreover, there was no information about handling of missing data and calibration in 13 (81%) and 11 (69%) studies, respectively, and the complete final model formula was not given in 13 (93%) of 14 model development studies. A high risk of bias in the predictors domain was seen in three studies because of reported heterogeneity in quality of biopsies. There were only minor concerns with regard to applicability of the included studies: one study had high concerns and two studies had unclear concerns regarding applicability.

General
AUCs of all studies ranged from poor to excellent (0.51-0.93) (Figure 3). AUCs were between 0.60 and 0.91 for the outcome pCR and 0.51 and 0.93 for the outcome GR. Apparent performance (discrimination of the model on the development cohort before internal validation) was reported in six of fourteen model development studies [51,54,56,57,59,60]. Five studies [49,54,56,57,60] described model calibration using the Hosmer-Lemeshow goodness-of-fit test, a calibration curve, or both. Calibration slopes and intercepts were not reported. Based on the predictors included in the final model, studies were grouped into the following five categories: clinical, combined (clinical, serological, and imaging), genetics, metabolites, and pathology. Besides the combined prediction model from Buijsen et al. [50], all models used predictors from a single category or combined those predictors only with standard clinical predictors.   There were only minor concerns with regard to applicability of the included studies: one study had high concerns and two studies had unclear concerns regarding applicability.

General
AUCs of all studies ranged from poor to excellent (0.51-0.93) (Figure 3). AUCs were between 0.60 and 0.91 for the outcome pCR and 0.51 and 0.93 for the outcome GR. Apparent performance (discrimination of the model on the development cohort before internal validation) was reported in six of fourteen model development studies [51,54,56,57,59,60]. Five studies [49,54,56,57,60] described model calibration using the Hosmer-Lemeshow goodness-of-fit test, a calibration curve, or both. Calibration slopes and intercepts were not reported. Based on the predictors included in the final model, studies were grouped into the following five categories: clinical, combined (clinical, serological, and imaging), genetics, metabolites, and pathology. Besides the combined prediction model from Buijsen et al. [50], all models used predictors from a single category or combined those predictors only with standard clinical predictors. x-axis represents included studies. Apparent performance was reported in six of fourteen model development studies [51,54,56,57,59,60]. The 95% confidence intervals were reported in 10 studies [46,47,[49][50][51][55][56][57][58]60] AUC = area under the curve; GR = good response; pCR = pathological complete response.

Clinical Prediction Models
Three studies developed a prediction model using clinical predictors only [47][48][49], with poor-moderate AUCs, ranging from 0.60 to 0.70. Joye et al. [47] developed and internally validated a model in one cohort (n = 620) for the outcomes pCR and good response (GR). Following backward predictor selection, the pCR model included six (age, American Society of Anesthesiologists (ASA) score, CEA, cN stage, gender, and hemoglobin (Hb)) and the GR model four (CEA, cT stage/mesorectal fascia (MRF), cN stage, and Hb) clinical predictors. Apparent AUCs were not reported. After bootstrapping, the pCR model had an AUC of 0.60 (range 0.56-0.62) and the GR model 0.63 (range 0.62-0.64). Kim et al. [48] evaluated a model on GR and randomly split the data into a development (n = 190), tuning (n = 41), and validation (n = 41) set. The clinical model included 10 predictors: age, alcohol, ASA, BMI, diabetes, distance from anal verge, gender, hypertension, smoking, and tumor grade. AUCs for the prediction of GR were 0.65 (mean ± std: 0.02), 0.53 (mean ± std 0.08), and 0.51 (mean ± std 0.08) in the training, tuning, and validation cohort. Kim et al. also developed models with additional serological predictors measured during CRT, of which the best model had slightly better discriminative performance than the pretreatment model. The study by Ren et al. [49] evaluated a model on pCR after CRT with mFOLFOX in 126 patients that included two predictors after forward predictor selection: MRF and tumor length. The apparent AUC was not reported; after bootstrapping the AUC was 0.70 (95% CI 0.61-0.79). Model calibration performance was evaluated with a calibration curve, which seemed to indicate miscalibration in the 0.2-0.4 predicted probability range.

Combined Clinical, Serological and Imaging Prediction Model
Only one study (Buijsen et al.) [50] combined pretreatment predictors of more than two different categories in a model. Four clinical (CEA, cT stage, cN stage, and tumor length), three serological (interleukin (IL)-6, IL-8, and osteopontin), and one imaging (maximal standardized uptake value (SUVmax)) predictor(s) were selected. Data of 276 patients were used for development and internal validation of the model, predicting both pCR and GR. Apparent AUCs were not reported; bootstrapped AUCs for pCR and GR were good and moderate: 0.81 (95% CI 0.73-0.88) and 0.78 (95% CI 0.71-0.85), respectively.

Genetics Prediction Models
Four models evaluated gene-based prediction models [51][52][53][54]. AUCs ranged widely, from poor to excellent (range 0.66-0.91). None of the genes were used in more than one model. The studies had small sample sizes (n = 24-184); only two studies used a cohort that consisted of more than 100 patients. Cho et al. [51] developed an eight-gene mRNA radio-response prediction index for GR using stepwise predictor selection in a cohort that included 184 patients, with an apparent AUC of 0.85 (95% CI 0.80-0.90) and AUCs after cross-validation that ranged from 0.81 (95% CI 0.71-0.91) to 0.91 (95% CI 0.84-0.98). Emons et al. [52] developed a 21-transcript signature for pCR prediction in a cohort of 64 patients using hill-climbing predictor selection. The apparent AUC was not reported. Several cohorts (n = 14-161) were used for internal and external validation, with AUCs between 0.70 and 0.81. The transient receptor potential channels (TRPC) score of Wang et al. [53], which reflects the expression of eight TRPG related genes, was developed using patients with colorectal cancer for the outcome overall survival. However, in the same article, the score was externally validated in two separate rectal cancer cohorts (n = 80 and n = 85) for the outcome GR, with AUCs of 0.66 and 0.71, respectively. In the study by Wei et al. [54], four immune-related differentially expressed genes were selected using least absolute shrinkage and selection operator (LASSO) regression and combined with two clinical predictors (age and gender) in a prediction nomogram. The analyzed cohort consisted of 59 patients that received chemotherapy as either 5-FU or capecitabine with or without oxaliplatin. The apparent AUC for GR was 0.82 and after bootstrapping 0.75. The graphically displayed calibration curve seemed to demonstrate slight miscalibration, especially in the higher predicted probability ranges.

Metabolites Prediction Models
Two studies evaluated metabolite-based models, Jia et al. (2018) and Lv et al. (2022) [46,55]. Both cohorts (n = 105 and n = 106, respectively) were from the same hospital which treated patients with one additional consolidation cycle of chemotherapy after CRT. The studies used the same methods for metabolite analysis but included different metabolites in the final models. Both used the outcome GR. Apparent AUCs were not reported. The 15-metabolite panel from Jia et al. [55] had a good AUC of 0.80 (95% CI 0.67-0.91) after bootstrapping and the eight-metabolite panel from Lv et al. [46] had a poor AUC of 0.54 (95% CI 0.43-0.65) after cross-validation. Lv et al. measured and evaluated metabolites at multiple time points; the last measurement was within two days before surgery. The use of multiple measurements improved the cross-validated discriminative performance slightly.

Pathology Prediction Models
There were six pathology-based prediction models [56][57][58][59][60][61]. AUCs were between 0.71 (good) and 0.91 (excellent) [56][57][58][59][60]. Jiang et al. published two collagen feature models, in 2021 [56] on GR and 2022 on pCR [57]. Both studies used data from patients in the same three hospitals, collected between 2010 and 2018. The collagen features were analyzed in the same manner using multiphoton images. One collagen feature was used in both models. In the study published in 2021, data were randomly split in a development (n = 299) and validation (n = 129) cohort. The authors first combined three collagen features in a Collagen Three models evaluated the predictive value of pathological whole slide images (WSIs). Lou et al. [58] developed a deep learning model with 666, 117, and 102 participants in the training, testing, and external validation cohort, respectively. The apparent AUC was not reported. AUCs for pCR in the testing and external validation cohorts were 0.71 (95% CI 0.60-0.81) and 0.72 (95% CI 0.59-0.84). Wang et al. published a digitalpathology-based deep learning model [59]. Patients in the development (n = 55) and randomly split (n = 14) cohorts were treated with induction chemotherapy (FOLFOX or CAPOX) and CRT. The evaluated model used a maximum of 500 samples from collected WSIs, which were processed and analyzed using a convolutional neural network and a graph neural network. The model showed an apparent AUC of 0.78 and an AUC of 0.73 in the randomly split cohort. Zhang et al. [60] developed a pathology signature with 17 predictors, selected from WSIs using LASSO, trained for the outcome GR. A total of 151 patients were randomly split in a development (n = 120) and validation (n = 31) dataset.
A subset of 78 (52%) patients received two additional cycles of consolidation chemotherapy (capecitabine or 5-FU/leucovorin) after CRT. It was unclear how many patients with GR received consolidation chemotherapy. The apparent AUC was 0.93 (95% CI 0.88-0.97) and the AUC of the nonrandomly split cohort 0.88 (95% CI 0.72-0.97). Calibration curves showed slight miscalibration in the lower and higher predicted probability ranges, although this was not seen in the p-values of the H-L goodness-of-fit test (0.332 and 0.213).
One study (Huang et al.) [61] externally validated the immunoscore, a prediction model that uses tumor-infiltrating lymphocytes (TILs). They evaluated the score using CD3+ and CD8+ TILs in a small cohort of rectal cancer patients (n = 55). The AUC for GR was 0.72. It is important to note that the immunoscore was originally not developed to predict the response to neoadjuvant therapy but, rather, to evaluate prognostic outcomes in colorectal patients who underwent surgery without neoadjuvant therapy [68,69].

Discussion
In this systematic review we summarized and critically appraised 16 validated pretreatment prediction models for response to neoadjuvant therapy in patients with rectal cancer. The models were grouped into five categories and included several promising predictors, with some models showing encouraging predictive potential. However, calibration outcomes were only reported in five studies and PROBAST indicated that all studies had a high risk of bias (ROB), mainly in the analysis domain. In addition, the majority of studies used small sample sizes, and external validation in independent studies was lacking. Based on these findings, we propose some recommendations for future research.
The two collagen feature-based models from Jiang et al. [56,57] showed the most promising results of all studies. The models reported good to excellent discriminative performance (AUCs 0.83-0.91) with relatively small confidence intervals, good calibration, and consistency across cohorts. The discriminative performance of both collagen-only signatures improved with the addition of clinical predictors, suggesting that the collagen features may be complementary to standard clinical predictors, rather than explain similar aspects of the tumor. Furthermore, multiphoton imaging for collagen feature analysis offers some advantages, such as rapid analysis using routinely collected pretreatment biopsies and a quick learning curve for newly trained pathologists [56,57,70,71]. In addition, comparable pretreatment collagen feature models showed good and excellent discrimination for prognostic outcomes in colon cancer (n = 882, disease-free survival/overall survival) and gastric cancer (n = 375, lymph node metastasis), respectively [71,72]. Thus, collagen featurebased prediction models are good potential candidates for further (external) validation, despite the high ROB of the included studies.
The pathological whole slide image (WSI)-based models [58][59][60], immunoscore [61], and metabolite-based models [46,55] reported poor to excellent discriminative performance (AUCs 0.54-0.93). Due to the heterogeneous results, small sample sizes, and high ROB, it is challenging to determine their exact predictive value. Consequently, we cannot confidently recommend these models for further validation. Nevertheless, the included predictors remain interesting for future research. Digital WSIs of pretreatment biopsies can be used to quantify pathological features [59,73]. WSI analysis has the potential to be easily incorporated in routine clinical practice, since standard rectal cancer biopsies can be used and the annotation workload for the pathologist is significantly reduced with new analysis and modelling methods [59,60,74]. Metabolomics is defined as the analysis of intermediate or end products of metabolic processes (metabolites) [75]. It is an easily obtained and minimally invasive method for response prediction, with serum as the preferred sample for analysis, but the literature about its role for the prediction of treatment response in patients with rectal cancer is scarce [76]. The immunoscore was originally developed [68] and later validated [69] for prognostic outcomes in colorectal patients undergoing surgery without neoadjuvant therapy. The score combines the densities of two tumor-infiltrating lymphocytes (TILs) populations at the center of the tumor and at the invasive margin. In addition to the included study from Huang et al. [61], El Sissy et al. [77] validated the pretreatment immunoscore in a cohort of 249 rectal cancer patients and showed a positive correlation of the immunoscore with tumor response and significantly higher pCR rates in patients with a high immunoscore. This study was excluded for data extraction and ROB assessment in the present review because tumors other than adenocarcinoma were included. The immunoscore has strong interobserver reproducibility [69] but the exact value of the immunoscore for pretreatment response prediction remains to be determined.
The four gene-based models [51][52][53][54] showed widely varying discriminative performances, ranging from poor to excellent (AUCs 0.66-0.91). Small sample sizes were used, a common issue in rectal cancer gene-based prediction model studies [30,31,78,79]. A second frequently occurring problem is the lack of overlap between genes in different studies [51,80]. None of the genes from the models included in this review were used in more than one model. Moreover, Cho et al. [51] compared their eight-mRNA gene model with 11 other gene expression studies and found that only one of their eight mRNA genes showed predictive value in other studies. Furthermore, a systematic review by Izzotti et al. identified 674 mRNAs and 77 microRNAs with potential predictive value for response to neoadjuvant therapy in colorectal cancer patients and only 19 mRNAs and 6 microRNAs were differentially expressed in more than one study [81]. The results of the gene-based models included in this review confirm the need for larger sample sizes and genes that show consistent predictive value across multiple prediction models.
We did not include radiomics-based and image-based deep learning models in this review because of the already available recent reviews. Two systematic reviews evaluated the predictive value of pretreatment radiomics-based models. Staal et al. evaluated 24 studies [32] and Di Re et al. 9 studies [34]. AUCs ranged from 0.47 to 0.99. A meta-analysis from 2022 by Jia et al. [33] evaluated 21 MRI radiomics-based models and image-based deep learning models. The pooled AUC of the validation cohorts was excellent (0.92 with 95% CI: 0.88-0.93), although a small number of studies also included models with both pre-and post-treatment predictors. As shown by these systematic reviews, the number of radiomics-based and image-based deep learning models is high and good-excellent AUCs were often reported. However, the models suffer from similar problems as those included in the current review, such as a high ROB, reproducibility issues, and a lack of external validation [32,34]. Consequently, radiomics-based models and image-based deep learning models are not ready for clinical practice, pending external validation in large multicenter prospective trials.
PROBAST indicated a high ROB in all studies, mainly because all studies had a high ROB in the analysis domain. We obtained these results despite only selecting studies that reported some form of internal or external validation. The results are in line with other systematic reviews that use PROBAST, as is shown in a recent meta-review of 50 systematic reviews that used PROBAST. Included studies were often considered to have a high or unclear ROB, mostly because of problems in the analysis domain [82]. The negative effect of a high ROB is quantified by Venema et al. [83], who rated 102 prediction models with a shortened form that included six PROBAST items. The median change in discrimination (AUC) between development and validation cohorts was significantly higher in high ROB models (−11.7%) compared to low ROB models (−0.9%). Large sample sizes that include a sufficient number of participants with the outcome are critical to minimize ROB. Twelve of sixteen (75%) included studies had a low number of participants with the outcome (pCR/GR) compared to the number of candidate predictors. A sample size calculation can be helpful to identify the maximum number of candidate predictors or to identify a dataset that is too small for a particular research question [84]. A problem such as the effect of not properly accounting for overfitting, which was seen in most included development studies, might even be negligible if a dataset is used with a sufficient number of participants with the outcome [83]. Unfortunately, even in large clinical rectal cancer trials, it is difficult to include a reasonable number of patients with pCR or GR, let alone patients with a (near) cCR. Alternatively, data collection in prospective observational cohort studies may improve sample size numbers.
Our study has some limitations. We could have missed interesting studies by excluding those that included patients with mucinous, signet-ring cell, or neuroendocrine tumors. However, these tumors may respond differently to neoadjuvant therapy than standard adenocarcinomas because of distinct tumor biology [85][86][87][88], which would have led to more heterogeneous results. Moreover, by excluding radiomics-based and image-based deep learning models, we might have missed potential studies that combine these models with predictors from other categories. Even so, we believe that most relevant models were included in the reviews of Di Re, Staal, and Jia [32][33][34], since these reviews also included models that added predictors from different categories.
The substantial ROB of selected studies in this review prevents the implementation of these models in a clinical setting. However, our findings indicate that some pretreatment prediction models, in particular the collagen-feature models and perhaps the genetics-, metabolites-, and pathology-based models, could have important clinical value in the future. Several recommendations can be given to guide future research on the additional clinical value of these models. The models should be further developed and externally validated in independent cohorts with a sufficient sample size, and appropriate statistical plans should be designed to minimize bias. Furthermore, easily obtainable predictors should be used that are measured with standardized methods to allow reproducibility of the results. In addition, the models need to be validated for organ preservation outcomes instead of postoperative outcomes given the conflicting results that exist regarding the concordance between cCR and pCR [9,89,90]. Finally, to improve generalizability, investigated patient populations should also include patients with early tumor stages and neoadjuvant treatment regimens other than CRT. These steps might enable us to select the most promising models that should be evaluated in the clinic and discard those models that are not worthy of further investigation.

Conclusions
This systematic review aimed to summarize and critically appraise validated pretreatment prediction models for response to neoadjuvant therapy in patients with rectal cancer and provide evidence-based recommendations for future research. Results were heterogeneous, all studies were considered to have a high risk of bias, and external validation in independent studies was lacking. Despite these limitations, several promising predictors were identified, and some models demonstrated encouraging predictive potential, in particular the collagen feature-based models. These studies could form the basis for future research, which should focus on the reduction of bias, evaluate reproducible predictors, and use outcomes specifically tailored to organ preservation. Furthermore, it is crucial that these models undergo external validation in independent studies in order to reach clinical value.  Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.